Mymensingh Division
Gemini Embedding: Generalizable Embeddings from Gemini
Lee, Jinhyuk, Chen, Feiyang, Dua, Sahil, Cer, Daniel, Shanbhogue, Madhuri, Naim, Iftekhar, Ábrego, Gustavo Hernández, Li, Zhe, Chen, Kaifeng, Vera, Henrique Schechter, Ren, Xiaoqi, Zhang, Shanfeng, Salz, Daniel, Boratko, Michael, Han, Jay, Chen, Blair, Huang, Shuo, Rao, Vikram, Suganthan, Paul, Han, Feng, Doumanoglou, Andreas, Gupta, Nithi, Moiseev, Fedor, Yip, Cathy, Jain, Aashi, Baumgartner, Simon, Shahi, Shahrokh, Gomez, Frank Palma, Mariserla, Sandeep, Choi, Min, Shah, Parashar, Goenka, Sonam, Chen, Ke, Xia, Ye, Chen, Koert, Duddu, Sai Meher Karthik, Chen, Yichang, Walker, Trevor, Zhou, Wenlei, Ghiya, Rakesh, Gleicher, Zach, Gill, Karan, Dong, Zhe, Seyedhosseini, Mojtaba, Sung, Yunhsuan, Hoffmann, Raphael, Duerig, Tom
Embedding models, which transform inputs into dense vector representations, are pivotal for capturing semantic information across various domains and modalities. Text embedding models represent words and sentences as vectors, strategically positioning semantically similar texts in close proximity within the embedding space (Gao et al., 2021; Le and Mikolov, 2014; Reimers and Gurevych, 2019). Recent research has focused on developing general-purpose embedding models capable of excelling in diverse downstream tasks, including information retrieval, clustering, and classification (Cer et al., 2018; Muennighoff et al., 2023). Leveraging their vast pre-training knowledge, large language models (LLMs) have emerged as a promising avenue for constructing such general-purpose embedding models, with the potential to significantly enhance performance across a broad spectrum of applications (Anil et al., 2023a,b; Brown et al., 2020). The integration of LLMs has revolutionized the development of high-quality embedding models through two primary approaches. Firstly, LLMs have been employed to refine training datasets by generating higher quality examples. Techniques such as hard negative mining (Lee et al., 2024) and synthetic data generation (Dai et al., 2022; Wang et al., 2023) enable the distillation of LLM knowledge into smaller, more efficient embedding models, leading to substantial performance gains. Secondly, recognizing that the embedding model parameters are frequently initialized from language models (Devlin et al., 2019; Karpukhin et al., 2020), researchers have explored leveraging LLM parameters directly for initialization (Ni et al., 2021).
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > India > Andhra Pradesh (0.04)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- (5 more...)
Crime Prediction using Machine Learning with a Novel Crime Dataset
Shohan, Faisal Tareque, Akash, Abu Ubaida, Ibrahim, Muhammad, Alam, Mohammad Shafiul
Crime is an unlawful act that carries legal repercussions. Bangladesh has a high crime rate due to poverty, population growth, and many other socio-economic issues. For law enforcement agencies, understanding crime patterns is essential for preventing future criminal activity. For this purpose, these agencies need structured crime database. This paper introduces a novel crime dataset that contains temporal, geographic, weather, and demographic data about 6574 crime incidents of Bangladesh. We manually gather crime news articles of a seven year time span from a daily newspaper archive. We extract basic features from these raw text. Using these basic features, we then consult standard service-providers of geo-location and weather data in order to garner these information related to the collected crime incidents. Furthermore, we collect demographic information from Bangladesh National Census data. All these information are combined that results in a standard machine learning dataset. Together, 36 features are engineered for the crime prediction task. Five supervised machine learning classification algorithms are then evaluated on this newly built dataset and satisfactory results are achieved. We also conduct exploratory analysis on various aspects the dataset. This dataset is expected to serve as the foundation for crime incidence prediction systems for Bangladesh and other countries. The findings of this study will help law enforcement agencies to forecast and contain crime as well as to ensure optimal resource allocation for crime patrol and prevention.
- Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- North America > Canada (0.04)
- (9 more...)